Analyzing and Visualizing Loan Data

by (Junwei Tan)

Dataset Detail

This dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set. I am not explore all of the variables in the dataset! Only focus exploration on about 10-15 of them.

Steps of Analysis

Preliminary Wrangling / Data Cleaning /Data Structure / Data Distribution Status

Data Cleaning

here are the list for those coloum has null value

counts and see the data structure.
change the datatype to correct type
find out whether the data is distribued or not

What is the structure of you dataset ?

The loan dataset(after cleaning) contains 83923 observations of 86 variable.

What are the main factors/features of interest in this dataset ?

The purpose of the report has two focus

First: The bank needs a pre-approve model to determine whether the application will fall in 'bad loan' category, what is the potential return from this loan. Hope this report can provide a basic reference for this pre-approve system no matter from the model or basic data analysis

Second: I hope this report can be used as a reference for periodic business summary, to review business situation from a macro or micro perspective, and find out the factors that affect the business (where, who, when, what), so the management can adjust the in time.

1. What factors affect the estimate return,

2. What factors caused the bad loan.

By looking at the data, i selected the columns below for use to analyze :
- EstimateReturn, 
- Term, 
- LoanStatus, 
- EmploymentStatus, 
- CurrentDelinquencies,
- IncomeRange, 
- StatedMonthly Income, 
- Loan Original, 
- LoanOriginal Date.

Univariate Exploration

I am going to run some counts plots for better understanding the structure for each category.Let's count the feature which i am interest from data

the result shows the bad loan ratio cpmparely low, the loan still active has almost 67% of total loan

other occupations has the most counts compare to the rest, but it doesn' help the bank to identifiy those customer, more detal needed when KYC cusotmers in the future.

according to census depart, the median income from 2008 to 2014 was list above, so the median fo those 7 years is $55820. in conclution, almost 60% of loan were low or median income range, they are the core cusotmers. 34% are high income customers

california has more than 12% percentage counts , the top 5 big states has more than 37% percentage counts

Around 70% loan are committed to 36 months.

Prosper Rating C has most customers, Almost 22%

the above 3 plot trying to demonstrade the listing loan, the listing number go up chronologically. q4 2013 had meet tht highest number

Different types of data can vary greatly. For example, some states's(likes California,Texas)loan originated total number are way more than others. Also like certain loan with specifit occupations type(like:Other) also has large portions than others. So the plots listed above give us a clearer picture to review and select our data. Here are the keywords of the important data categories.

Bivariate Exploration

Now let’s dive in to find out the factors/feature that affect the loan outcome, and the outcome affecting the estimate return.

1. Here are the features I will pick up for revieing the correlation with the outcome of bad/good loan. Ownership/original amount, current delinquent/prosper rating / income/Debt to income ratio.

2. Here are the features I selecting for reviewing the correlation with estimate return. State, occupations, prosper rating, Loan Listing time.

Estimate Return and Loan Status are the most pay attention features i will focus on,so the other factors will apply to Estimate income and Loan Status

Prosper Rating C leading number for the current existing customer, but suprisely almost 1/3 current loan are poor estimate loan. but if your loan is chargedoff or completed, those loan are higher chance good earning loan

Prosper Rating C D and E has higher Estimate Return compare to other rating. defaulted and charged off status have higher estimate return if you are in the same prosper rating category.

occupation doesn't affect the estiamte income

eventhou i slit the performance into only two level, it telling us loan estimate incom doesn't relate with what customers occupations. good earning and not good earning half and half

the good/bad loan ratio distributed same among all occupations.

Prosper Rating C leading all others rating in all occupations. but the distrubued ratio is almost same for all occupations

Here is the plot have there different feature togther, (loan status/occupations/estimate return). we can see the correlation for those three features. again, occupations has no much impact to loan status and restimaed return.

i am going see the correlation amuont time and loan status/ estimate return

current loan has more precentage in all kind of rating, it make sence because current loan status cusotmer's number is much higher than other loan status

In Current Loan, more than 25% cusotmer has 4.0 rating, more than 20%(of it's total rating precentage) has 5.0 rating and 18% 6.0 rating, much higher than the loan in chargeoff, past due and defaulted.

Current, Completed,Chargedoff and Defaulted are the majority data of the loan dataset (more than 90%) . Current loan has almost 69%

The most current loan was skyrocketing from 2013, current loan status cusotmer has more than 66% percentage of total loan, means bank business skyrocketing after 2013.

time impact LoanStatus obviously since 2011,most of the loan last to current was started from 2011,it going up tremendously after 2013.

during the financial crisis in 2009, there is barely any loan record. there is a peak in 2011 suddenly lot of loan were completed.

it show there was a dip of good loan in 2013 but skyrocketing after then. bad loan not changing much for all time.

surprisingly good loans are positively correlated with low estimate earning.

apparently good loan are outnumber bad loan. original amount with 10k and 15k are the most common orignal amount.

eventhough 'good loan' has significant large number than 'bad Loan', the plot above can the original amount for good loan and bad loan

originaed loan total increased dramatically after q2 of 2013. bad loan ratio are decreased.

Q2 and Q3 have the most originated total count and also have the higest average Estimate return

Query the count for only in Q2 and Q3 for each years, because the q2 and q3 has the strongest origination Number

here we can see the mean different between good/bad loan in all columns.

the average INDEX Between Bad Loan and not Bad Loan

Good Loan or not bad loan has lower APR and Lower estimate loss by average. but bad loan actually has higher estimate return than good loan by average.

The average Estimate Return between good Loan and not good loan

plt.figure(figsize=(11.69, 8.27)) axes = earnmean.plot.bar(rot=0, subplots=True)

Multivariate Exploration

As i mentioned above, I hope this report can be used as a reference for periodic business summary, to review business situation from a macro or micro perspective, and find out the factors that affect the business (where, who, when, what), so the management can adjust the in time.

so here i tried to find out 'when','where','what' affect the estimate return or loan status.

Multiple histograms giving us the pictures of data distrution for different importance features.

Try to answer the question 'when'

plot in a mutivariate plot, try to see what is the trend(time)of estimate return between Bad Loan or not Bad Loan

Since the mid term of 2011, the red dot'not good loan' are cover by the good loan. Good loan are outnumber bad loan since then.

California has a very large number of loans compared with other states.

Plot in the californbia estimate return trend for bad/not bad loan, because california has the largest laon compare to the other states. Good loan out number bad loan after mid term of 2013

Here looking for the answer 'where', does location correlate with the loan ?

same heat plot, shorten income range to 4 catergory, viewing the plot in different angle.

This is a verification for the heatplot data, The average APR in Major business involved state who's income stated was "0"IL and TX customer enjoy very low if their state income is 0

Very Suprise(According to the chart) that IF YOUR STATE INCOME IS NONE IN IL or TX, YOU ACTUALLY CAN GET A VERY LOW APR, in california,there is a totally different story. But need a fact check like whether the data has outlier

There are only 5 countd for 0 income, 0 income at TX and IL are outlier, the heatplot is misleading

zero balance is outlier

Good/Bad loan are distrubuted, no outlier

need to be carefully when plotting data, make sure data is clean, make sure data being process adequately and accurately.

seems no different after the outlier have been removed for plot 1 / categorized income range into 4 level

the rows '0' , 'not displayed' and 'not emploted' are outlier data. The APR is correlate with income, APR going lower by follwering higher Income Range,same on the contrast.

Low income/ bad loan catogorized group are actually have better EstimatedReturn compare to High income/Good Loan group, no matter which states you currently located.

the pairgrid try to answer 'what' feature affect the estimate return

PAIRGRID is a very useful and dynamic plot function which you can plot in dataset in multiple variable at the same time.

Pairgrid telling me 'IncomeRange' not affecting the loanestimated return, But the APR Range and Term does correlate with the Estimate Return . High APR or median APR with 60 or 36 month term has low income are the key factor affect bank estimate income to higherIt suprise me that HighAPR and Low Income are the factor make Estimate Return higher.

We know bad loan has average higher estimate return than good loan, probably bank will charge those bad loan group customer more fee in order to cover the risk, but once loan become charge off or default loan, bank will suffer lost. so the second plot only remains current, complete or finial payment process loans (good loan) so we can tell what other factor affect the estimate income.

Low APR and term of 12 months(too short loan term) are the keey factors that affect the estimate income. Borrowers 's income are not quite correlated.

Categorized the loans the bank try to avoid. and figure out what causing the those loan. They are:

'Charged Off' and 'Default'

A loan becomes “charged off” when there is no longer a reasonable expectation of further payments. Charge off typically occurs when a loan is 120 days or more past due. Learn more about the difference between “charge off” and “default”.

A Loans that are in "Default" are loans for which borrowers have failed to make payments for an extended period of time.A loan becomes “Charged Off” when there is no longer a reasonable expectation of further payments. Charge Off typically occurs when a loan is 120 days or more past due and there is no reasonable expectation of sufficient payment to prevent the charge off. In certain circumstances, loans may be charged off at an earlier or later date. Please note, loans for which borrowers have filed for bankruptcy may be charged off earlier based on the date of bankruptcy notification.

plot in data into pointplot, seeing the correlation between originalmaount, APR range in different income level

calculate the precentage of declins loan in each APR group

Very highAPR APR Group has around 26% chance fallin declines loan, compare to HighAPR 24%, MdianAPR 12% and lowapr 3%. So higher the APR, higher chance Declins Loan

calculate the precentage of declins loan in each income range

'Income' correlate with the 'precentage of become declines loan', the higher income, the lower chance become declins loan. if customer's income larger than100k, there is less than 5% chance become declines loan , but those not employed cusotmer, declins loan chance is 5 times than the 100k above customers.

'Declines loan' normally has higher APR than 'Non declines loan', more obvious between $1-$100k income range

Debt to Income ratio are not much different but only can see the different in the income range under 25k.

'income' and 'APR' are the factors that affect the loan status, most declines loan has lower monthly income and higher APR compare to undeclines loan, when Prosper Score is between range 2-6 and income are close or below 4k, there is high probability declines loan will happens

Summary and Insights

As I mentioned before there are two major purpose I studying this dataset, find the correlation causing default and charge off loans, find out the features (who,where,when,what) affects the estimate return.

  1. Here are the key words for most of the loans (CA, Current, Term 36, Income $50000-$74999, 2013, Q2,Q3. ProsperRating(C,D), Occupation: Other)
  2. Answer(When)2013 is the turning point for loan business, the loan listing amount skyrocketing, the major listing Loan (current, complete) go up dramatically ; Q2 and Q3 are more busy quarter compare to q1 and q4 by average.
  3. Answer(Who & Where)Customer’s Job title(who), Which state they living(where) does not affecting their loan status or estimate income.
  4. Answering(What) By average speaking, Bad loan has higher average estimate income than Good Loan. (what).
  5. Correlation: Income/APR(higher income, lower APR); Apr/Estimate Return( High APR,Higher Return) ; Term/Return (Longer Term, Higher Return).
  6. Estimate Income ( lower income, higher estimate return, who is your occupation or when you live doesn’t matter)‘income' and 'APR' are the factors that affect the loan status, most declines loan has lower monthly income and higher APR compare to not declines loan, when Prosper Score is between range 2-6 and income are close or below 4k, there is high probability declines loan will happen.

  7. Higher Income and Lower APR, more chance not lead to charge off and default loan.

Key Insights for Presentation:

There is contradiction between avoids bad loan and maximize estimate return. Because the bad loan has higher estimate average income 0.12 than good loan 0.09. and other plots are verify my finding. I guess the reason is bank charge more fee or interest for those high APR and low-income customer in order to cover the potential bad loan lost, that causing the bad loan average estimate return higher than good loan. But that doesn’t mean we don’t avoid bad loan, because if bank not handle charge/default loans properly, it will turn to lost, even though estimate return is higher but actually return are badly hurt because the bank spend more resource to handing the loan.

i know there are some thought, viewpoint and perspectives are not being consider, also some feature not being used in this report. I also believes there are more better skill or theory that I haven’t been acknowledged. This report will remind and inspire me to keep learning

Markdown for Jupyter notebooks Cheatsheet ,

Python Pandas Value_counts() tricks that make your work more efficient

isin(),str.contains()

Ufuncs

Commenting Python Code

Discussion:Storytelling and Success Stories

Five Rules of Data Exploration

15 Data Exploration techniques to go from Data to Insights

Graphic Collection

5 Python Libraries for Creating Interactive Plots